This document provides an overview of the bayesOmic package that will be available at Bioconductor (http://bioconductor.org/). The package implements a Bayesian approach for omic association studies. We propose a shared-component model to tease out the feature information (e.g. SNPs, CNVs, genes, transcripts, CpGs) that is common to cases and controls from the one that is specific to cases. This allows to detect the SNPs (bayesSNPassoc function) or any continuos feature (bayesOmicAssoc function) that show the strongest association with the disease. The model can nevertheless be applied to more than one disease. More detailed information about the model and assumptions are given in J. J. Abellan, Abellan, and Gonzalez (2010) and González, Abellán, and Abellán (2012) in the case of analyzing SNPs or CNVs, respectively. We illustrate how to analyze SNP data by using a synthetic data set of a targeted study (e.g. a selection of SNPs), GWAS data comparing different populations from 1000 Genomes project (To be supplied) and expression data (i.e. a RangedSummarizedExperiment object for an RNA-Seq experiment) on four human airway smooth muscle cell lines.
The bayesOmic package uses JAGS, a program for analysis of Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation Plummer and others (2003), to estimate model parameters. The current implementation of bayesOmic is based on JAGS version 4.3.0. JAGS has an R interface rjags that is used by the package (rjags version 4-9).
The development version of bayesOmic can be installed from our GitHub repository
library(devtools)
install_github("isglobal-brge/bayesOmic")
Then, the package is loaded as usual
library(bayesOmic)
Let us illustrate how to apply our model to a GWAS study as described in J. J. Abellan, Abellan, and Gonzalez (2010). The examples correspond to a simulated data including information about 72 SNPs. It contains information of 800 individuals divided in 4 populations: controls and 3 type of cases (M1, M2, and M3). We simulated SNPs that are specific from gene 1 to be associated with M1 and SNPs from gene 3 to M3.
Let us start by describing how to get the data into R
Data can be imported from a text file or can be loaded using snpMatrix package (to be supplied). We provide a simulated example that can be loaded by typing:
data(sim.data)
This is a simulated data having 72 SNPs and one column with case/control status
dim(sim.data)
[1] 800 73
sim.data[1:5, 1:5]
caco snp74519580 snp30066728 snp32588600 snp57373945
1 Control AB AA BB AB
2 Control AB AB AA AB
3 Control AA AB AB AA
4 Control AB AB AA AA
5 Control AB AA AA AA
The package requires to have the case-control status and the SNPs in to different objects:
y <- sim.data$caco
SNPs <- sim.data[, -1]
It is worth to notice that GWAS data is normally in PLINK format. The Genomic Data Storage format is ….. To be supplied (modify function)
The model can be fitted by using the function bayesSNPassoc as follows:
mod <- bayesSNPassoc(group = y, data=SNPs, sep.allele="", method="JAGS")
The function bayesSNPassoc prepares the data (NOTE: it aggregates the number of rare alleles by the grouping variable. When SNP data is a data.frame it uses SNPassoc package to get the number of rare alleles. In that case the sep.allele argument defines how alleles are separated - sep.allele="" is the default value). Then the function calls rjags to estimate model parameters. We have set up the following default arguments to be passed through rjags functions:
args(bayesSNPassoc)
function (group, data, sep.allele = "", annotation, chr, call.rate = 0.9,
min.freq = 0.05, sig.level = 0.05, method = "inla", n.iter.burn.in = 1000,
n.iter = 5000, thin = 10, n.chain = 2, ...)
NULL
Notice that other arguments realated to MCMC estimation using JAGS can be passed through this function. More details about them can be obtained at http://calvin.iarc.fr/ martyn/software/jags/.
The function also has two arguments to perform a simple quality control. The argument call.rate stands for the call rate of a given individual (e.g. proportion of non-missing SNPs, default is 0.9) and min.freq provides the minimum allele frequency allowed for a given SNP (default is 0.05).
Before interpreting the simulations obtained from de a posteriori distri bution (e.g. model parameters), Markov chains convergence migth be verified. This can be done by using the function checkConvergence. This function has an argument called type that defines the type of plot to be obtained. When type="Markov chain" (default value) the function calls to plot.mcmc() function from package coda Ontherwise, Gelman-Rubin plots are displayed. The function checkConvergence has another argument, parameter, to indicate the model parameter to be summaryzed. The default is alpha. For example, Figure can be obtained by executing:
checkConvergence(mod)
Figure 1: Chains convergence of alpha parameter for SNP model of ‘sim.data’
Other model parameters (Figure \ref{fig:checkConvergence2) are summaryzed by changing the argument called parameter.
checkConvergence(mod, parameter = "beta")
Figure 2: Chains convergence of alpha parameter for SNP model of ‘sim.data’
The trace plot can be obtained by
checkConvergence(mod, type = "trace")
Figure 3: Chains convergence of alpha parameter for SNP model of ‘sim.data’ using Gelman-Rubin method
The JAGS estimation can be very time consuming. INLA (Integrated Nested Laplace Approximation) can be used instead
mod.inla <- bayesSNPassoc(group = y, data=SNPs, sep.allele="", method="INLA")
The print() and plot() functions also work with this method
This are the shared components
plot(mod.inla, type="shared")
and this the specific
plot(mod.inla, type="specific")
Model parameters (intercept and shared component) can be obtained by using print function:
mod
Intercepts (alpha):
2.5% 50% 97.5%
Control -0.8453899 -0.8692369 -0.8454839 -0.8212086
M1 -0.7903692 -0.8410633 -0.7895358 -0.7412849
M2 -0.8351066 -0.8612506 -0.8348243 -0.8094776
M3 -0.7754098 -0.8237909 -0.7770748 -0.7237574
Coefficients of shared components (beta):
2.5% 50% 97.5%
M1 1.4383355 1.235339e-08 0.31073920 11.639091
M2 0.4755054 3.245502e-10 0.08822252 2.736444
M3 1.8677611 6.676204e-09 0.39134913 13.082979
Use 'getSpecific()' and 'getShared()' functions to get specific or shared components, respectively.
On the other hand, specific and shared components can be visualize by:
plot(mod, type="specific")
and
plot(mod, type = "shared")
respectively.
Finally, a hierarchical clustering can be performed by using the predicted probabilities by typing:
makeHeatmap(mod)
The figure shows a Heatmap were we can observe that groups M1 and M3 are different from controls and group M2.
We provide a real data example that can be loaded by typing:
data(armengol)
Model parameter estimates are obtained using the function bayesOmicAssoc by executing:
mod.CNV <- bayesOmicAssoc(group="pop", data=armengol)
The intercept ans coefficient of shared components of populations can be obtained by:
mod.CNV
Intercepts (alpha):
2.5% 50% 97.5%
CEU 1.963230 1.900412 1.962320 2.036787
CHB 1.980471 1.928113 1.981505 2.023672
YRI 1.994390 1.948687 1.995756 2.034592
Coefficients of shared components (beta):
2.5% 50% 97.5%
CEU -0.003547929 -0.34902437 0.01602909 0.37232342
CHB 0.012235748 -0.18975163 0.04219106 0.19559769
YRI -0.007897079 -0.09120255 -0.01331183 0.06396489
Use 'getSpecific()' and 'getShared()' functions to get specific or shared components, respectively.
We can check model convergence by
checkConvergence(mod.CNV)
checkConvergence(mod.CNV, type="trace")
The shared components are
plot(mod.CNV, type="shared")
The Specific components for each population can be obtained by typing:
plot(mod.CNV, type="specific")
We can obtain the name of the specific features of a given population by
getSpecific(mod.CNV, group="YRI")
0.02083333% 50% 99.97917% sig
A_14_P133280 0.18845241 0.3757252 0.551398456 1
Chr16_70653608 0.16521939 0.3567767 0.517470120 1
Chr22_22690592 -1.08263271 -0.9346243 -0.703315278 -1
Chr3_46771035 -1.23736689 -1.0565910 -0.815555478 -1
Chr3_164046699 -0.56913859 -0.3663315 -0.147128074 -1
Chr6_32594054 -1.44036759 -1.2597585 -1.018248402 -1
Chr7_141965669 0.23258746 0.5614095 0.702714180 1
Chr1_193470134 -1.26381352 -0.9798692 -0.748392943 -1
Chr9_Pop_1 0.12200154 0.3227919 0.495663848 1
Chr15_Pop_1 0.05656680 0.2363705 0.413002858 1
Chr1B_Pop_1 0.61187254 0.9533931 1.137132224 1
Chr16_Pop_1 0.11959767 0.2886871 0.440029242 1
Chr13_Pop_2 -1.26638780 -1.0673508 -0.860049649 -1
Chr15_Pop_2 -0.70157723 -0.4421261 -0.155667714 -1
Chr1B_Pop_2 -1.25738914 -1.0607002 -0.806091356 -1
Chr22_Pop_2 0.18955438 0.3759284 0.559668906 1
Chr16_Pop_2 0.39454559 0.7717614 0.961744645 1
Chr14_Pop_1 -0.38825197 -0.2424276 -0.001948421 -1
Chr14_Pop_2 0.01802474 0.2249488 0.406063662 1
chr17_420_B -0.69969064 -0.4758211 -0.270970698 -1
A_14_P127743 0.11448146 0.4845810 0.643373741 1
A_14_P126481 0.11941396 0.5329414 0.766685591 1
chr17_335_A 0.68280135 0.8967815 1.088850265 1
chr1_723_B -0.95357575 -0.7694218 -0.373258176 -1
chr17_576_B -1.02370573 -0.7175496 -0.481911087 -1
chr19_8743145_8784792_B -0.87422378 -0.6373753 -0.366011338 -1
chr17_42973027_43038163_A -0.97373829 -0.7655191 -0.563640810 -1
chr5_150_A -0.96944956 -0.6375767 -0.389076625 -1
chr5_150_B -1.07444547 -0.9112062 -0.720817903 -1
chr7_143_A -0.43812801 -0.2659312 -0.053226534 -1
chr12_111_A -0.61933486 -0.4481253 -0.071816101 -1
chr12_111_B -1.00907552 -0.8049506 -0.590159226 -1
chr16_543_A 0.75241065 0.9579340 1.134285249 1
chr17_415_A 0.04738639 0.2986104 0.503667435 1
chr5_704_A 0.51322367 0.7545153 0.968261105 1
chr5_704_B 0.56210398 0.8396936 1.063031532 1
chr17_429_A2 0.09920391 0.4137488 0.743640914 1
chr17_429_B2 0.54696027 0.7407740 0.993223187 1
The heatmap is obtained by:
makeHeatmap(mod.CNV, quantiles=c(.025, 0.2, 0.5, 0.8, 0.975))
Dataset can be loaded by typing:
library(airway)
data(airway)
We can analyse a specific genomic range of interest (i.e. chromosome 1, positions 1 to 10000000). We define the roi by executing:
library(GenomicRanges)
library(IRanges)
roi <- GRanges(seqnames = "1", ranges= IRanges(1,10000000))
Model parameter estimates are obtained using the function bayesOmicAssoc with the argument roi by executing:
mod.Expr <- bayesOmicAssoc(group="dex", data=airway, roi = roi)
The intercept ans coefficient of shared components of populations can be obtained by:
mod.Expr
Intercepts (alpha):
2.5% 50% 97.5%
trt 65.38641 59.02911 64.70174 73.11770
untrt 67.72806 60.76016 67.39973 74.87659
Coefficients of shared components (beta):
2.5% 50% 97.5%
trt -33.630672 -1173.774 -33.64596 1106.829
untrt -9.640538 -1133.550 -10.45542 1116.820
Use 'getSpecific()' and 'getShared()' functions to get specific or shared components, respectively.
We can check model convergence by
checkConvergence(mod.Expr)
checkConvergence(mod.Expr, type="trace")
The shared components are
plot(mod.Expr, type="shared")
The Specific components for each population can be obtained by typing:
plot(mod.Expr, type="specific")
We can obtain the name of the specific features of a given population by
getSpecific(mod.Expr, group="trt")
0.007598784% 50% 99.9924% sig
ENSG00000074800 2029.6426 2212.090 2370.688 1
ENSG00000116285 852.5968 2377.657 3856.968 1
The heatmap is obtained by:
makeHeatmap(mod.Expr, quantiles=c(.025, 0.2, 0.5, 0.8, 0.975))
This work has been partly supported by ….
Abellan, Juan J, Carlos Abellan, and Juan R Gonzalez. 2010. “A Bayesian Shared Component Model for Genetic Association Studies.” bepress.
González, Juan R, Carlos Abellán, and Juan J Abellán. 2012. “Bayesian Model to Detect Phenotype-Specific Genes for Copy Number Data.” BMC Bioinformatics 13 (1). BioMed Central: 130.
Plummer, Martyn, and others. 2003. “JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling.” In Proceedings of the 3rd International Workshop on Distributed Statistical Computing, 124:10. 125. Vienna, Austria.